A01.TEX[365,RWF] - www.SailDart.org

perm filename A01.TEX[365,RWF] blob sn#865871 filedate 1988-12-12 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00002 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\input macro[1,mps]
C00016 ENDMK
C⊗;
\input macro[1,mps]
\magnification\magstephalf
\baselineskip 14pt
\leftline{\sevenrm A01.tex[365,mps]\today}
\leftline{\copyright Robert W. Floyd, November 3, 1988}
\leftline{\sevenrm Unpublished Draft}
\def\th{↑{\rm th}}
\def\and{\mathbin{\rm\&}}
\bigskip
\noindent{\bf Average Time Analysis of Coalesced Hashing}
\medskip
\noindent{\bf Table Search Lemmas (TSL)}

In a table of size $s$, there are $u$ unoccupied cells in random positions.  
The number of arrangements of unoccupied cells is $s\choose u$.  The number
of arrangements in which the $i\th$ cell is the first empty one is
${s-i} \choose {u-1}$.  Then $P[i\th \hbox { cell is first empty}]$ is
${s-i \choose u-1}/{s \choose u}$.  Let the random variable $q$ be the
location of the first empty cell.  Then
$$\eqalign{E[q↑{\overline p}] &= \sum↓{i \geq 1} i↑{\overline p} 
{s-i\choose u-1}\bigg/
{s \choose u} = {p!\over{s \choose u}} \sum↓{i \geq 1} {i+p-1 \choose p}
{s-i \choose u-1}\cr
& = p! {p+s \choose p+u}\bigg/{s \choose u}
 = p! {(s+1)↑{\overline p} \over (u+1)↑{\overline p}}
\quad \hbox {(also, } p! {(s+p)↑{\underline p} \over (u+p)↑{\underline p}}).\cr}$$
Similarly, the number of arrangements where the $i\th$ cell is the last empty one 
is ${i-1 \choose u-1}$.  Let $q$ be the location of the last empty cell.  Then
$$E[q↑{\overline p}] = \sum↓{i\leq s} i↑{\overline p} {i-1\choose u-1}\bigg/
{s\choose u} = {u↑{\overline p} \over {s\choose u}} \sum↓{i \leq s}
{i+p-1 \choose u+p-1}$$ 
\noindent which by upper summation [GKP pg. 174] is
$${u↑{\overline p} {s+p \choose u+p}\bigg/{s\choose u}}= {u\over u+p}
(s+1)↑{\overline p}.$$
\medskip
\noindent{\bf Conditional Expected Value Theorem (CEVT)}
\medskip
Assume random variables $X, Y$ have a joint distribution.  Define $f(y) =
E[\phi (X,Y)\mid {Y=y}]$.  Then $E[\phi (X,Y)] = E[f(Y)]$.  If $X$ is
numerical, setting $\phi (X,Y) = X$ gives 
$$f(y) = E[X\mid Y =y],\ E[X] = E[f(Y)].$$
If $f$ is linear, $E[X] = E[f(Y)] = f(E[Y])$.
\medskip
\noindent{\bf Conditional Variance Theorem [Unused]}
$$V[\phi (X,Y)] = E[V[\phi (X,Y)\mid X=x]] + V [E [\phi (X,Y)\mid X=x]].$$
\noindent{\bf Linear Recurrence Lemma (LRL) [Used]}
\medskip
One solution of $x↓{t+1} = ax↓t + bt+c\>(a\not= 1)$ is $x↓t = {1\over 1-a}
(bt+c - {b\over 1-a})$.  For the general solution, add a multiple of the
solution $x↓t = a↑t$ of the homogeneous $x↓{t+1} = ax↓t$.  For
$x↓0 = 0$, the solution is $x↓t = {1\over 1-a}(bt+(c - {b\over 1-a})(1-a↑t))$.
\vfill\eject
\noindent{\bf Eigenvector Lemma}
\medskip
Let $\{Y↓i\}$ be vectors, $T$ a linear transformation with $T(Y↓i) = d↓iY↓i$.
If $X$ is a vector for which $T(X) = \alpha X + \sum\limits↓i\beta↓iY↓i$, 
$\alpha\not=$ any $d↓i$,
then another eigenvector of $T$ is $Z = X + \sum\limits↓i
{\beta↓i Y↓i \over \alpha - d↓i}$; $T(Z) = \alpha Z$.
\vskip 1in
A typical situation after inserting $t$ keys by coalesced hashing into a
table of size $h$.  The cellar pointer $C↓t$ helps save time in searching
for empty cells, all of which have indices less than $C↓t$; $C↓0 = h+1$.  All
keys lie on one of several linked lists.  To hash a new key with hash
address $a$, if that address is empty insert the key as a list of length 1.
If that address is full, follow the list to the end to confirm that the
key is not present.  Move $C↓t$ left to the rightmost empty cell.  Enter
the key there.  Make the previous end of the list point to $C↓t$, extending the
list.  [Reference Knuth V3, Vitter]

We study the behavior of several random variables; for each $t$, these are 
$C↓t↑{\overline p}$; $S↓t$ (the multiset of list lengths after $t$
insertions); $S↓t↑{\overline p} = \sum \limits↓{\sigma\in S↓t}
\sigma↑{\overline p}$, the
$p\th$ factorial power sum of $S↓t$; $B↓t$, the cumulative number of list
probes after $t$ insertions, and more generally $B↓t↑{\overline p}$.

First we find the conditional expected value $E[C↓{t+1}↑{\overline p} \mid
C↓t↑{\overline p} = \gamma]$.  With probability ${h-t} \over h$, the $t + 1↑{st}$
key hashes to an empty cell and $C↓{t+1}↑{\overline p} = \gamma$.  With
probability $t/h$, there is a collision, and the second table search
lemma, with $s = \gamma-1$, $u = h-t$, gives $C↓{t+1}↑{\overline p}$ an 
expected value
of ${h-t\over h-t+p}\gamma$.  Taking a weighted average,
$$\eqalign{E[C↓{t+1}↑{\overline p} \mid C↓t↑{\overline p} = \gamma] &= \gamma 
\left({h-t\over h}+ {t (h-t)\over h(h-t+p)}\right )\cr
&= \gamma {(h+p)(h-t) \over h (h-t+p)}.\cr}$$

\noindent By the CEVT,
$$E\bigr[C↓{t+1}↑{\overline p}\bigr] = E\bigr[C↓t↑{\overline p}\bigr] 
{(h+p)(h-t) \over h(h-t+p)}.$$

\noindent By an easy induction,
$$E\bigr[C↓t↑{\overline p}\bigr] = (h-t+1)↑{\overline p}\left({h+p\over 
h}\right)↑t.$$
For fixed exponent $p$ and density $d = t/h$, asymptotically
$E[C↓t↑{\overline p}] \sim (h(1-d) e↑d)↑p$.  The case $p=1$ is Knuth's
problem 6.4-41 [Knuth V3].  Using $p=1$ and $2$, the variance is
$$\eqalign {V[C↓t] &= E[C↓t↑{\overline 2}- C↓t] - E[C↓t]↑2\cr
&= \left({h+2\over h}\right)↑t (h-t+1)↑{\overline 2}- \left({h+1\over h}\right)↑t
(h-t+1) - \left({h+1\over h}\right)↑{2t}
(h-t+1)↑2.\cr}$$

Next we want $E[S↓{t+1}↑{\overline p} \mid S↓t = {\cal S}]$.  With
probability ${h-t\over h}$ there is no collision and a new list of length 1 is 
inserted, $S↓{t+1}= {\cal S}\and 1$.  For each $\sigma \in
{\cal S}$ and each $1 \leq i \leq \sigma$, there is probability $1/h$
of a collision with the $i\th$ item from the end of a list of length $\sigma$,
requiring $i$ probes in the list, and increasing the chain length by one; 
$S↓{t+1} = {\cal S}\overline{\and}\sigma \and(\sigma + 1)$.  With probability
${h-t \over h}$, $S↓{t+1}↑{\overline p} = {\cal S}↑{\overline p}+ 1
↑{\overline p} = {\cal S}↑{\overline p}+ p!$.  For each other case, with
probability $1/h$, a collision with the $i\th$ element from the end of 
a list of length $\sigma$ gives
$$S↓{t+1}↑{\overline p}= {\cal S}↑{\overline p}- \sigma↑{\overline p} + 
(\sigma + 1)
 ↑{\overline p} = {\cal S}↑{\overline p}+ p (\sigma + 1)↑{\overline {p-1}}.$$
Summing over $i$ for a fixed $\sigma$ gives $\sigma {\cal S}↑{\overline p} 
+ p\sigma ↑{\overline p}$.  
Summing over $\sigma$ and dividing by $h$ gives $t {\cal S}↑{\overline p} 
+ p {\cal S}↑{\overline p} =
(t+p){\cal S}↑{\overline p}/h$.  Adding ${h-t\over h} ({\cal S}↑{\overline p}+
p!)$ gives
$$E[S↓{t+1}↑{\overline p}\mid S↓t = {\cal S}] = {h+p\over h} 
{\cal S}↑{\overline p}
+ {h-t\over h} p!.$$
\medskip
\noindent By the CEVT,
$$E[S↓{t+1}↑{\overline p}] = {h+p\over h} E[S↓t↑{\overline p}] + {h-t\over h}
p!.$$
Applying the linear recurrence lemma, with particular solution $p!
({h(1-p)\over p↑2} + {t\over p})$ and homogeneous solution $({h+p\over h})↑t$,
gives
$$E[S↓t↑{\overline p}] = p! \left ({h(1-p)\over p↑2} \left (1 -\left({h+p\over h}
\right)↑t\right) 
+ {t\over p}\right)
\quad \hbox{for } p \not= 0.$$
In particular:
$$\eqalign{E[S↓t↑{\overline 0}] &= {t(2h-t+1) \over 2h} \sim h(d -d↑2/2)\cr
E[S↓t↑{\overline 1}]&= t \sim d h\cr
E[S↓t↑{\overline 2}]&= {h \over 2} \left(\left({h+2 \over h}\right)↑t-1\right) 
+ t \sim h \left({e↑{2d}-1 \over 2} + d \right).\cr}$$
A list of length $\sigma$ represents a history of collisions, whose number has
expected value 
$$(\sigma + 2)(\sigma - 1)/4 = (\sigma↑{\overline 2} - 2 \sigma↑{\overline 0})/4.$$
Then $E[B↓t\mid S↓t = {\cal S}] = ({\cal S}↑{\overline 2}- 2 {\cal S}
↑{\overline 0})/4$.
\medskip
\noindent By the CEVT,
$$\eqalign{E[B↓t] &= E[S↓t↑{\overline 2}]/4 - E[S↓t↑{\overline 0}]/2\cr
&= {h \over 8} \left (\left ({h+2\over h}\right)↑t-1\right ) - {t\over 4h}  
(h-t+1).\cr}$$
Let's try to find $E[B↓{t+1}↑{\overline p}]$, conditioned on $B↓t = {\cal B}$ and
$S↓t = {\cal S}$.  With probability $h-t\over h$, $B↓{t+1}↑{\overline p}= 
{\cal B}↑{\overline p}$.  For each $\sigma \in {\cal S}$, $1 \leq i \leq \sigma$,
there are $i$ collisions with probability $1/h$, and $B↓{t+1}↑{\overline p} =
({\cal B}+ i)↑{\overline p}$.  Then
$$E[B↓{t+1}↑{\overline p} \mid B↓t = {\cal B}] = {h-t\over h} {\cal B}↑
{\overline p}+
{1\over h} \sum↓{1\leq i\leq\sigma\in{\cal S}}({\cal B}+ i)↑{\overline p}.$$
This is hard for general $p$.  For $p=1$,
$$\eqalign{E[B↓{t+1} \mid B↓t = {\cal B}] &= {h-t\over h} {\cal B}+ {1\over h}
\sum↓{1\leq i \leq\sigma\in{\cal S}}({\cal B}+i)\cr
&={\cal B}+ {1\over h} \sum↓{\sigma\in{\cal S}} {\sigma↑{\overline 2} \over 2}\cr
&= {\cal B}+ {1\over 2h} {\cal S}↑{\overline 2}.\cr}$$
$$\eqalign{E[B↓{t+1}] &= E[B↓t] + {1\over 2h} E[S↓t↑{\overline 2}]\cr
&= E[B↓t] + {\left ({h+2\over h}\right )↑t -1\over 4} + {t\over 2h};\cr}$$
$$\eqalign {E[B↓t] &= {h\over 8} \left (\left ({h+2\over h}\right )↑t -1\right )
- t/4 + {t(t-1)\over 4h}\cr
& \sim {h\over 8} \left (e↑{2d}- 1 - 2d + 2d↑2\right)\cr}$$
\noindent Challenge:  solve for $E\bigr[B↑{\overline 2}↓t\bigr]$.
\bye